The goal of this market segmentation is to develop user segments or “personas” to inform future marketing efforts and product development. This analysis will use an unsupervised machine learning approach to cluster the users into distinct personas. Unsupervised clustering allows for a data driveway to infer structure within the data, aka clusters. The distinct clusters generated can then be interpreted to understand how Duolingo users might be similar or different from each other. Because many of the survey data was collected as categorical variables, this analysis will use a Kmodes clustering algorithm, some numerical data was converted into categories as Kmodes can only handle categorical variables. Kmodes works by iteratively comparing the similarity of each new point k centroids. The new data point is then clustered with the cluster that it is most like, and a new centroid is calculated. The distance between cluster and new point is measured by dissimilarity (total mismatches between data points).
User survey and usage data from Duolingo. The survey asked users a series of questions about demographics (e.g., country, age, employment status), and motivation (e.g., primary reason for studying a language). The goal of this survey was to develop user segments (or personas) to inform future marketing efforts and product development. Usage data was collected from August 1, 2018 to November 5, 2018.
EDA data analysis revealed a significant portion (57%) of the daily goal values were NAs. Decided to drop this column to preserve as much data as possible. Over 50% of users from (MX, FR, JP, RU) have a purchased subscription. Interestingly these 4 countries also have the lowest percentage of users in the 0-10,000 salary range, high number of lessons completed. However, for other variables such has commitment, employment status, and age is less clear to see a difference from these geographies. Additional we found those we purchase the subscription also use the app more.
High Value Customer. - Most likely to purchase a subscription. - Very active with Duolingo app - Very committed to learning - High proportion of retirees - Generally, older (55 – 74). - Generally, earn more - Mixed language proficiency
New Language Students - Least likely to purchase a subscription - Generally younger (18-34) - Most earn less than 10k - Learning a language for the first time - Highest probability of being a student or unemployed
Working Adult Reviewer - Reviewing a language they have studied before - Generally middle age (35 – 54) - Generally earning $26k – $75k - Most likely to take a placement test - Highest employment rate
Recommendations for product changes and marketing campaigns - High Value Customer. o Consider developing a loyalty and referral program targeted for this group. Highlight referral scheme, as word of mouth is the best way to win new customers o Have dedicated service representatives if they have issues - New Language Students o Young, group of new language learns. Consider targeting campaigns that will expose them to multiple new languages to help them discover one that interest them. o Appeal to young people’s desire to experience new languages with travel marketing campaigns focus on travel - Working Adult Review o Most likely to review an old language, target notifications and marketing campaigns of relearning an old language. o Most likely to be working a job, consider sending notifications after work.
Daily has more than 50% of values as NAs. Choose to remove this feature
Longest_streak has a group of individuals with abnormally high longest_streaks. This could be a technical issue with how the data was collected Choose to remove this abnormal group. Other features such as n_lessons_started and n_days_on_platform look as we expect
What is the distribution of time spent completing the survey? We do not want survey results that are inaccurate. Histogram of time spent on survey. Choose to remove users who did not spend at least 100 seconds (log10(2)) filling out the survey.
As expeted those who purchase a subscription are more active.
Heatmap shows number of active days, lessons started, lessons_completed, and highest crown count seem to correlate with each other. And an overall trend that most of these app usage features are positively correleated with each other
# Check correlations. In order to check correlations, we 1 hot encode the categorical variables purchased_subscription and took_placement_test
df_usage$purchased_subscription <- as.integer(as.logical(df_usage$purchased_subscription))
df_usage$took_placement_test <- as.integer(as.logical(df_usage$took_placement_test))
df_corr <- df_usage[,c("highest_course_progress", "took_placement_test", "purchased_subscription", "highest_crown_count",
"n_active_days","n_lessons_started","n_lessons_completed","longest_streak","n_days_on_platform")]
df_corr <- na.omit(df_corr)
### Get lower triangle of the correlation matrix
cormat <- round(x = cor(df_corr), digits = 2)
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri## highest_course_progress took_placement_test
## highest_course_progress 1 0.18
## took_placement_test NA 1.00
## purchased_subscription NA NA
## highest_crown_count NA NA
## n_active_days NA NA
## n_lessons_started NA NA
## n_lessons_completed NA NA
## longest_streak NA NA
## n_days_on_platform NA NA
## purchased_subscription highest_crown_count
## highest_course_progress 0.19 0.66
## took_placement_test 0.07 0.19
## purchased_subscription 1.00 0.29
## highest_crown_count NA 1.00
## n_active_days NA NA
## n_lessons_started NA NA
## n_lessons_completed NA NA
## longest_streak NA NA
## n_days_on_platform NA NA
## n_active_days n_lessons_started n_lessons_completed
## highest_course_progress 0.37 0.26 0.27
## took_placement_test 0.07 0.13 0.13
## purchased_subscription 0.36 0.33 0.33
## highest_crown_count 0.55 0.52 0.52
## n_active_days 1.00 0.50 0.50
## n_lessons_started NA 1.00 0.98
## n_lessons_completed NA NA 1.00
## longest_streak NA NA NA
## n_days_on_platform NA NA NA
## longest_streak n_days_on_platform
## highest_course_progress 0.34 0.45
## took_placement_test 0.02 -0.07
## purchased_subscription 0.25 0.10
## highest_crown_count 0.51 0.35
## n_active_days 0.47 0.17
## n_lessons_started 0.27 0.01
## n_lessons_completed 0.27 0.01
## longest_streak 1.00 0.28
## n_days_on_platform NA 1.00
### Melt
melted_cormat <- melt(upper_tri, na.rm = TRUE)
### Heatmap
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()MX, FR, JP and RU have extremely high rates of subscription payments. Their users also share high levels of commitment to learning the language and have the same age profiles (relatively high proportion of 55-74 year olds) These 4 countries are all non-english native speaking. Denmark (DE) is another non-english speaking country with similar age demographics and high levels of commitment, but despite this, has a proportion of subscribes. DE market could be a un-tapped market. One strategy could to be lower the cost for subscriptionsi in Denmark to ease the point of entry
Denmark users have much few 151k+ earners compared to MX, FR, JP and RU. Consider lowing the subscription cost in this region.
Most of the survey data is categorical data, let’s use k-modes to cluster them. The goal of this clustering is to identify certain groups within the data set. First we build our data frame and then use an elbow plot to determin the optimal number k clusters. From the elbow plot we select 3.
The modes of our clusters
## age annual_income employment_status student
## 1 35 - 54 $26,000 - $75,000 Employed full-time Not currently a student
## 2 18-34 $0 - $10,000 Employed full-time Not currently a student
## 3 55 - 74 $26,000 - $75,000 Employed full-time Not currently a student
## duolingo_subscriber
## 1 No, I have never paid for Duolingo Plus
## 2 No, I have never paid for Duolingo Plus
## 3 Yes, I currently pay for Duolingo Plus
## primary_language_commitment
## 1 I'm moderately committed to learning this language.
## 2 I'm moderately committed to learning this language.
## 3 I'm very committed to learning this language.
## primary_language_review
## 1 I am using Duolingo to review a language I've studied before.
## 2 I am using Duolingo to learn this language for the first time.
## 3 I am using Duolingo to learn this language for the first time.
## primary_language_proficiency took_placement_test n_lessons_completed_cat
## 1 Intermediate 1 2
## 2 Beginner 0 1
## 3 Beginner 1 3
## purchased_subscription
## 1 0
## 2 0
## 3 1
Radar chart summarizes the key attributes of each cluster.
3d plots wiht plotly shows our 3 clusters with course progress, purchased_subscription, and active days